The Relation Between MOS and Pairwise Comparisons and the Importance of Cross-Content Comparisons

نویسنده

  • Emin Zerman
چکیده

Subjective quality assessment is considered a reliable method for quality assessment of distorted stimuli for several multimedia applications. The experimental methods can be broadly categorized into those that rate and rank stimuli. Although ranking directly provides an order of stimuli rather than a continuous measure of quality, the experimental data can be converted using scaling methods into an interval scale, similar to that provided by rating methods. In this paper, we compare the results collected in a rating (mean opinion scores) experiment to the scaled results of a pairwise comparison experiment, the most common ranking method. We find a strong linear relationship between results of both methods, which, however, differs between content. To improve the relationship and unify the scale, we extend the experiment to include cross-content comparisons. We find that the crosscontent comparisons reduce the confidence intervals for pairwise comparison results, but also improve the relationship with mean opinion scores. Introduction Subjective quality assessment is being used by many domains including psychology, medical applications, computer graphics, and multimedia. Regardless of the domain, it is regarded as a reliable method of quality assessment and it is often employed to collect “ground-truth” quality scores. Two of the main methods of subjective quality assessment for multimedia content are direct rating and ranking. Direct rating methods ask the observers to assign scores to observed stimuli. They may involve displaying a single stimulus (absolute category rating (ACR), single stimulus continuous quality evaluation (SSCQE)), or displaying two stimuli (double stimulus impairment scale (DSIS), double stimulus continuous quality evaluation (DSCQE)). Ranking methods ask the observers to compare two or more stimuli and order them according to their quality. The most commonly employed ranking method is pairwise comparisons (PC). Pairwise comparisons were argued to be more suitable for collecting quality datasets because of the simplicity of the task and consistency of the results [1, 2]. Those works, however, did not consider an important step in analysis of pairwise comparison data, which is scaling pairs of comparisons onto an interval quality scale. In this work we analyze the importance of this step and demonstrate how it enables to yield a unified quality scale between rating and ranking methods. The vast majority of studies employing the pairwise comparison method compare only the images depicting the same content, for example comparing different distortion levels applied to the same original image. This “apple-to-apple” comparison simplifies the observers’ task, making results consistent within content. However, it also comes with some limitations. On one hand, assessing and scaling each content independently makes it difficult to obtain scores that correctly capture quality differences between conditions across different contents on a common quality scale. On the other hand, pairwise comparison capture only relative quality relations. Therefore, in order to assign an absolute value to such relative measurements, the experimenter needs to assume a fixed quality for a certain condition which is then used as reference for the scaling. As a result, the scaling error accumulates as conditions get perceptually farther from the reference. In this work we study the effect of adding cross-content comparisons, showing that this not only does allow to unify the quality scale across content, but it also improves significantly the accuracy of scaled quality scores. In order to understand the effect of cross-content pairwise comparison, we conduct three different experiments using pairwise comparison and double stimulus impairment scale methodologies. There are three major findings of this paper: • There is a strong linear relation between the mean opinion scores (MOS) obtained by direct rating, and scaled PC results; • The addition of cross-content comparisons to the traditional PC reduces error accumulation and increases accuracy when scaling PC results; • Cross-content comparisons align the PC scaling results of different contents to a common quality scale, reducing content dependency. For this study, we use the high dynamic range (HDR) video quality dataset, presented in our previous work [3]. Detailed information on scaling, the video quality database used, and the results are presented in the following sections. Related work There has been a substantial amount of work comparing different methodologies for the subjective quality assessment. In [4], Pinson and Wolf compared single-stimulus and doublestimulus continuous quality evaluation methods (SSCQE and double-stimulus continuous quality scale (DSCQS)) and found that the quality estimates are comparable to one another. In [5], ACR, DSIS, DSCQS and SAMVIQ were compared. The authors found no significant differences between the compared methods. The compared methods were also ranked for the assessment times and the ease of evaluation. It was found that from fastest to slowest, the ranking was ACR, DSIS, SAMVIQ, and DSCQS. The ease of evaluation analysis yielded a similar result with the exception that ACR with 11-point scale was the hardest to evaluate whereas ACR with 5-point scale was the easiest. SAMVIQ and ACR were further compared in [6], and SAMVIQ was found to require fewer subjects and longer time compared to ACR. In the study of Mantiuk et al. [7], four different subjective methods were compared: single-stimulus categorical rating (absolute category rating with hidden reference (ACR-HR)), double-stimulus categorical rating, forced-choice pairwise comparison, and pairwise similarity judgments. No significant difference was found between double-stimulus and single-stimulus methods, in agreement with the previous studies. The forced-choice pairwise comparison method was found to be the most accurate and requring the least experimental effort amongst the four compared methods. The methodology of a subjective experiment depends on the intent and research problem. Although direct rating methods are able to obtain quality scores directly, ranking methods such as pairwise comparison offer additional preference information. There are several advantages in using pairwise comparison methodology. Since users are expected to choose one of the pairs (or “same” in some cases), PC does not require a quality scale. The users are able to decide faster compared to direct rating methods. Since the task is much more intuitive, the training of the subjects is simpler and less critical than for the rating methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Body Weight is an Important Trait for Comparisons of Goat Breeds

Eleven goat breeds of different origin (10 Spanish and 1 African) were considered for this study: Zambian Dwarf, Azpi-Gorri, Blanca-Andaluza, Blanca-Celtibérica, Blanca de Rasquera, Catalana, Malagueña, Moncaína, Murciano-Granadina, Negra Serrana and Pyrenean. Nine linear body measurements for females, as well as the body weight ratio between gender and body weight, were obtained from literatur...

متن کامل

The Analytic Network Process

Abstract   The Analytic Network Process (ANP) is a generalization of the Analytic Hierarchy Process (AHP). The basic structure is an influence network of clusters and nodes contained within the clusters. Priorities are established in the same way they are in the AHP using pairwise comparisons and judgment. Many decision problems cannot be structured hierarchically because they involve the inter...

متن کامل

Evaluation of psychometric properties of Persian Version of Social Comparisons during chronic illness scale

Background: When a person is under stressful circumstances, such as confrontation with a life-threatening disease, often make comparisons with each other, which can have effects on their psychological and physical well-being. This research aimed to evaluate the validity and reliability of Persian version of social comparisons during chronic illness scale. Methods: The study was descriptive wit...

متن کامل

Application of Analytic Network Process in Selection of Six-Sigma Projects

This research aims at presenting a fuzzy model to evaluate and select Six-Sigma projects.  For this purpose, a model of fuzzy analytic network process (ANP) was designed to consider the relation and mutual impact among the factors. In order to evaluate the projects, nine sub-criteria were considered which were classified into three categories of business, finance and procedural ones. Also to co...

متن کامل

Comparisons of Experimental and Simulated Velocity Fields in Membrane Module Spacers

Spacers are used in spiral wound and plate and frame membrane modules to create flow channels between adjacent membrane layers and mix fluid within the flow channel. Flow through the spacer has a significant beneficial impact on mixing and resulting mass transfer rates but is accompanied by an undesirable increase in pressure drop. Computational Fluid Dynamics (CFD) is a common tool used to eva...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2018